Methods, Systems, and Software for Identifying Functional Bio-Molecules

ABSTRACT

The present invention generally relates to methods of rapidly and efficiently searching biologically-related data space. More specifically, the invention includes methods of identifying bio-molecules with desired properties, or which are most suitable for acquiring such properties, from complex bio-molecule libraries or sets of such libraries. The invention also provides methods of modeling sequence-activity relationships. As many of the methods are computer-implemented, the invention additionally provides digital systems and software for performing these methods.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. patent application Ser. No.11/981,578 [Atty Ref: MXGNP004X2C1], filed Oct. 30, 2007, naming RichardJohn Fox as inventor, which is a continuation of U.S. patent applicationSer. No. 10/874,802 [Atty Ref: MXGNP004X2], filed Jun. 22, 2004, namingRichard John Fox as inventor, which is a continuation-in-part of U.S.patent application Ser. No. 10/629,351 [Atty Ref: MXGNP004X1], filedJul. 29, 2003, naming Richard John Fox as inventor, which is acontinuation-in-part of U.S. patent application Ser. No. 10/379,378[Atty Ref: MXGNP004], filed Mar. 3, 2003, naming Gustaffson et al. asinventors which in turn claims priority under 35 USC 119(e) from U.S.Provisional Application No. 60/360,982, filed Mar. 1, 2002. Each ofthese documents is incorporated herein by reference in its entirety andfor all purposes.

BACKGROUND

The present invention relates to the fields of molecular biology,molecular evolution, bioinformatics, and digital systems. Morespecifically, the invention relates to methods for computationallypredicting the activity of a biomolecule. Systems, including digitalsystems, and system software for performing these methods are alsoprovided. Methods of the present invention have utility in theoptimization of proteins for industrial and therapeutic use.

Protein design has long been known to be a difficult task if for noother reason than the combinatorial explosion of possible molecules thatconstitute searchable sequence space. The sequence space of proteins isimmense and is impossible to explore exhaustively. Because of thiscomplexity, many approximate methods have been used to design betterproteins; chief among them is the method of directed evolution. Directedevolution of proteins is today dominated by various high throughputscreening and recombination formats, often performed iteratively.

In parallel, various computational techniques have been proposed forexploring sequence-activity space. Relatively speaking, these techniquesare in their infancy and significant advances are still needed.Accordingly, new ways to efficiently search sequence space to identifyfunctional proteins would be highly desirable.

SUMMARY

The present invention provides techniques for generating and usingmodels that employ non-linear terms, particularly terms that account forinteractions between two or more residues in the sequence. Thesenon-linear terms may be “cross product” terms that involvemultiplication of two or more variables, each representing the presence(or absence) of the residues participating in the interaction. In someembodiments, the invention involves techniques for selecting thenon-linear terms that best describe the activity of the sequence. Notethat there are often far more possible non-linear interaction terms thanthere are true interactions between residues. Hence, to avoidoverfitting, only a limited number of non-linear are typically employedand those employed should reflect interactions that affect activity.

One aspect of the invention provides a method for identifying amino acidresidues for variation in a protein variant library. This method may becharacterized by the following operations: (a) receiving datacharacterizing a training set of a protein variant library, (b) from thedata, developing a sequence-activity model that predicts activity as afunction of amino acid residue type and corresponding position in aprotein sequence; and (c) using the sequence-activity model to identifyone or more amino acid residues at specific positions for variation toimpact the desired activity. The sequence-activity model includes one ormore non-linear terms, each of which represents an interaction betweentwo or more amino acid residues in the protein sequence. The trainingset data provides activity and sequence information for each proteinvariant in the training set.

The protein variant library may include proteins from various sources.In one example, the members include naturally occurring proteins such asthose encoded by members of a single gene family. In another example,the members include proteins obtained by using a recombination-baseddiversity generation mechanism. For example, DNA fragmentation-mediatedrecombination, synthetic oligonucleotide-mediated recombination or acombination thereof may be performed on nucleic acids encoding all orpart of one or more naturally occurring parent proteins for thispurpose. In still another example, the members are obtained byperforming DOE to identify the systematically varied sequences.

In some embodiments, at least one non-linear term is a cross-productterm containing a product of one variable representing the presence ofone interacting residue and another variable representing the presenceof another interacting residue. The form of the sequence-activity modelmay be a sum of at least one cross-product term and one or more linearterms, with each of the linear terms representing the presence of avariable residue in the training set. The cross-product terms may beselected from a group of potential cross-product terms by varioustechniques including, for example, running a genetic algorithm to selectthe cross-product terms based upon the predictive ability of variousmodels employing different cross-product terms.

The sequence activity model may be produced from the training set bymany different techniques. In a preferred embodiment, the model is aregression model such as a partial least squares model or a principalcomponent regression model. In another example, the model is a neuralnetwork.

In some embodiments, the method also includes (d) using the sequenceactivity model to identify one or more amino acid residues that are toremain fixed (as opposed to being varied) in new protein variantlibrary.

Using the sequence activity model to identify residues for fixing orvariation may involve any of many different possible analyticaltechniques. In some cases, a “reference sequence” is used to define thevariations. Such sequence may be one predicted by the model to have ahighest value (or one of the highest values) of the desired activity. Inanother case, the reference sequence may be that of a member of theoriginal protein variant library. From the reference sequence, themethod may select subsequences for effecting the variations. In additionor alternatively, the sequence activity model ranks residue positions(or specific residues at certain positions) in order of impact on thedesired activity.

One goal of the method may be to generate a new protein variant library.As part of this process, the method may identify sequences that are tobe used for generating this new library. Such sequences includevariations on the residues identified in (c) above or are precursorsused to subsequently introduce such variations. The sequences may bemodified by performing mutagenesis or a recombination-based diversitygeneration mechanism to generate the new library of protein variants.This may form part of a directed evolution procedure. The new librarymay also be used in developing a new sequence activity model. The newprotein variant library is analyzed to assess effects on a particularactivity such as stability, catalytic activity, therapeutic activity,resistance to a pathogen or toxin, toxicity, etc.

In some embodiments, the method involves selecting one or more membersof the new protein variant library for production. One or more of thesemay then be synthesized and/or expressed in an expression system. In aspecific embodiment, the method continues in the following manner: (i)providing an expression system from which a selected member of the newprotein variant library can be expressed; and (ii) expressing theselected member of the new protein variant library.

In some embodiments, rather than use amino acid sequences, the methodsemploy nucleotide sequences to generate the models and predict activity.Variations in groups of nucleotides, e.g., codons, affect the activityof peptides encoded by the nucleotide sequences. In some embodiments,the model may provide a bias for codons that are preferentiallyexpressed (compared to other codons encoding the same amino acid)depending upon the host employed to express the peptide.

Yet another aspect of the invention pertains to apparatus and computerprogram products including machine-readable media on which are providedprogram instructions and/or arrangements of data for implementing themethods and software systems described above. Frequently, the programinstructions are provided as code for performing certain methodoperations. Data, if employed to implement features of this invention,may be provided as data structures, database tables, data objects, orother appropriate arrangements of specified information. Any of themethods or systems of this invention may be represented, in whole or inpart, as such program instructions and/or data provided onmachine-readable media.

These and other features of the present invention will be described inmore detail below in the detailed description of the invention and inconjunction with the following figures.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a flow chart depicting a sequence of operations, includingidentifying particular residues for variation, that may be used togenerate one or more generations of protein variant libraries.

FIG. 2 is a flow chart depicting a genetic algorithm for selectingnon-linear cross product terms in accordance with an embodiment of thisinvention.

FIGS. 3A-3F are graphs showing examples of this invention in which thepredictive capabilities of certain linear and non-linear models arecompared.

FIG. 4 is a flow chart depicting a bootstrap p-value method ofgenerating protein variant libraries in accordance with an embodiment ofthis invention.

FIG. 5 is a schematic of an example digital device.

DETAILED DISCUSSION OF THE INVENTION I. Definitions

Before describing the present invention in detail, it is to beunderstood that this invention is not limited to particular sequences,compositions, algorithms, or systems, which can, of course vary. It isalso to be understood that the terminology used herein is for thepurpose of describing particular embodiments only, and is not intendedto be limiting. As used in this specification and appended claims, thesingular forms “a”, “an”, and “the” include plural referents unless thecontent and context clearly dictates otherwise. Thus, for example,reference to “a device” includes a combination of two or more suchdevices, and the like. Unless indicated otherwise, an “or” conjunctionis intended to be used in its correct sense as a Boolean logicaloperator, encompassing both the selection of features in the alternative(A or B, where the selection of A is mutually exclusive from B) and theselection of features in conjunction (A or B, where both A and B areselected).

The following definitions and those included throughout this disclosuresupplement those known to persons of skill in the art.

A “bio-molecule” refers to a molecule that is generally found in abiological organism. Preferred biological molecules include biologicalmacromolecules that are typically polymeric in nature being composed ofmultiple subunits (i.e., “biopolymers”). Typical bio-molecules include,but are not limited to, molecules that share some structural featureswith naturally occurring polymers such as RNAs (formed from nucleotidesubunits), DNAs (formed from nucleotide subunits), and polypeptides(formed from amino acid subunits), including, e.g., RNAs, RNA analogues,DNAs, DNA analogues, polypeptides, polypeptide analogues, peptidenucleic acids (PNAs), combinations of RNA and DNA (e.g., chimeraplasts),or the like. Bio-molecules also include, e.g., lipids, carbohydrates, orother organic molecules that are made by one or more geneticallyencodable molecules (e.g., one or more enzymes or enzyme pathways) orthe like.

The term “nucleic acid” refers to deoxyribonucleotides orribonucleotides and polymers (e.g., oligonucleotides, polynucleotides,etc.) thereof in either single- or double-stranded form. Unlessspecifically limited, the term encompasses nucleic acids containingknown analogs of natural nucleotides that have similar bindingproperties as the reference nucleic acid and are metabolized in a mannersimilar to naturally occurring nucleotides. Unless otherwise indicated,a particular nucleic acid sequence also implicitly encompassesconservatively modified variants thereof (e.g., degenerate codonsubstitutions) and complementary sequences as well as the sequenceexplicitly indicated. Specifically, degenerate codon substitutions maybe achieved by generating sequences in which the third position of oneor more selected (or all) codons is substituted with mixed-base and/ordeoxyinosine residues (Batzer et al. (1991) Nucleic Acid Res. 19:5081;Ohtsuka et al. (1985) J. Biol. Chem. 260:2605-2608; Rossolini et al.(1994) Mol. Cell. Probes 8:91-98). The term nucleic acid is usedinterchangeably with, e.g., oligonucleotide, polynucleotide, cDNA, andmRNA.

A “nucleic acid sequence” refers to the order and identity of thenucleotides comprising a nucleic acid.

A “polynucleotide” is a polymer of nucleotides (A, C, T, U, G, etc. ornaturally occurring or artificial nucleotide analogues) or a characterstring representing a polymer of nucleotides, depending on context.Either the given nucleic acid or the complementary nucleic acid can bedetermined from any specified polynucleotide sequence.

The terms “polypeptide” and “protein” are used interchangeably herein torefer to a polymer of amino acid residues. Typically, the polymer has atleast about 30 amino acid residues, and usually at least about 50 aminoacid residues. More typically, they contain at least about 100 aminoacid residues. The terms apply to amino acid polymers in which one ormore amino acid residues are analogs, derivatives or mimetics ofcorresponding naturally occurring amino acids, as well as to naturallyoccurring amino acid polymers. For example, polypeptides can be modifiedor derivatized, e.g., by the addition of carbohydrate residues to formglycoproteins. The terms “polypeptide,” and “protein” includeglycoproteins, as well as non-glycoproteins.

A “motif” refers to a pattern of subunits in or among biologicalmolecules. For example, the motif can refer to a subunit pattern of theunencoded biological molecule or to a subunit pattern of an encodedrepresentation of a biological molecule.

“Screening” refers to the process in which one or more properties of oneor more bio-molecule is determined. For example, typical screeningprocesses include those in which one or more properties of one or moremembers of one or more libraries is/are determined.

The term “covariation” refers to the correlated variation of two or morevariables (e.g., amino acids in a polypeptide, etc.).

“Directed evolution” or “artificial evolution” refers to a process ofartificially changing a character string by artificial selection,recombination, or other manipulation, i.e., which occurs in areproductive population in which there are (1) varieties of individuals,with some varieties being (2) heritable, of which some varieties (3)differ in fitness (reproductive success determined by outcome ofselection for a predetermined property (desired characteristic). Thereproductive population can be, e.g., a physical population or a virtualpopulation in a computer system.

A “data structure” refers to the organization and optionally associateddevice for the storage of information, typically multiple “pieces” ofinformation. The data structure can be a simple recordation of theinformation (e.g., a list) or the data structure can contain additionalinformation (e.g., annotations) regarding the information containedtherein, can establish relationships between the various “members”(i.e., information “pieces”) of the data structure, and can providepointers or links to resources external to the data structure. The datastructure can be intangible but is rendered tangible when stored orrepresented in a tangible medium (e.g., paper, computer readable medium,etc.). The data structure can represent various informationarchitectures including, but not limited to simple lists, linked lists,indexed lists, data tables, indexes, hash indices, flat file databases,relational databases, local databases, distributed databases, thinclient databases, and the like. In preferred embodiments, the datastructure provides fields sufficient for the storage of one or morecharacter strings. The data structure is optionally organized to permitalignment of the character strings and, optionally, to store informationregarding the alignment and/or string similarities and/or stringdifferences. In one embodiment, this information is in the form ofalignment “scores” (e.g., similarity indices) and/or alignment mapsshowing individual subunit (e.g., nucleotide in the case of nucleicacid) alignments. The term “encoded character string” refers to arepresentation of a biological molecule that preserves desiredsequence/structural information regarding that molecule. As notedthroughout, non-sequence properties of bio-molecules can be stored in adata structure and alignments of such non-sequence properties, in amanner analogous to sequence based alignment can be practiced.

A “library” or “population” refers to a collection of at least twodifferent molecules, character strings, and/or models, such as nucleicacid sequences (e.g., genes, oligonucleotides, etc.) or expressionproducts (e.g., enzymes) therefrom. A library or population generallyincludes a number of different molecules. For example, a library orpopulation typically includes at least about 10 different molecules.Large libraries typically include at least about 100 differentmolecules, more typically at least about 1000 different molecules. Forsome applications, the library includes at least about 10000 or moredifferent molecules.

“Systematic variance” refers to different descriptors of an item or setof items being changed in different combinations.

“Systematically varied data” refers to data produced, derived, orresulting from different descriptors of an item or set of items beingchanged in different combinations. Many different descriptors can bechanged at the same time, but in different combinations. For example,activity data gathered from polypeptides in which combinations of aminoacids have been changed is systematically varied data.

The terms “sequence” and “character strings” are used interchangeablyherein to refer to the order and identity of amino acid residues in aprotein (i.e., a protein sequence or protein character string) or to theorder and identity of nucleotides in a nucleic acid (i.e., a nucleicacid sequence or nucleic acid character string).

II. Generating Improved Protein Variant Libraries

In accordance with the present invention, various methods are providedfor generating new protein variant libraries that can be used to exploreprotein sequence and activity space. A feature of many such methods is aprocedure for identifying amino acid residues in a protein sequence thatare predicted to impact a desired activity. As one example, suchprocedure includes the following operations:

-   -   (a) receiving data characterizing a training set of protein        variants, wherein the data provides activity and sequence        information for each protein variant in the training set;    -   (b) from the data, developing a sequence activity model that        predicts activity as a function of amino acid residue type and        corresponding position in the sequence (preferably the model        includes one or more non-linear terms, each representing an        interaction between two or more amino acid residues); and    -   (c) using the sequence activity model to identify one or more        amino acid residues at specific positions in one or more protein        variants that are to be varied in order to impact the desired        activity.

FIG. 1 presents a flow chart showing one application of the presentinvention. It presents various operations that may be performed in theorder depicted or in some other order. As shown, a process 01 begins ata block 03 with receipt of data describing a training set comprisingresidue sequences for a protein variant library. In other words, thetraining set data is derived from a protein variant library. Typicallythat data will include, for each protein in the library, a complete orpartial residue sequence together with an activity value. In some cases,multiple types of activities (e.g., rate constant data and thermalstability data) are provided together in the training set.

In many embodiments, the individual members of the protein variantlibrary represent a wide range of sequences and activities. This allowsone to generate a sequence-activity model having applicability over abroad region of sequence space. Techniques for generating such diverselibraries include systematic variation of protein sequences and directedevolution techniques. Both of these are described in more detailelsewhere herein. Note however that it is often desirable to generatemodels from gene sequences representing a particular gene family; e.g.,a particular kinase found in multiple species. As most residues will beidentical across all members of the family, the model describes onlythose residues that vary. Thus statistical models based on suchrelatively small training sets, compared to the set of all possiblevariants, are valid in a local sense. The goal is not to find a globalfitness function since that is beyond the capacity (and often the need)of the systems under consideration.

Activity data may be obtained by assays or screens appropriatelydesigned to measure activity magnitudes. Such techniques are well knownand are not central to this invention. The principles for designingappropriate assays or screens are widely understood. Techniques forobtaining protein sequences are also well known and are not central tothis invention. The activity used with this invention may be proteinstability (e.g., thermal stability). However, many important embodimentsconsider other activities such as catalytic activity, resistance topathogens and/or toxins, therapeutic activity, toxicity, and the like.

After the training set data has been generated or acquired, the processuses it to generate a sequence-activity model that predicts activity asa function of sequence information. See block 05. Such model is anon-linear expression, algorithm or other tool that predicts therelative activity of a particular protein when provided with sequenceinformation for that protein. In other words, protein sequenceinformation is input and an activity prediction is output. For manyembodiments of this invention, the model can also rank the contributionof various residues to activity. Methods of generating such models,which all fall under the rubric of machine learning, (e.g., partialleast squares regression (PLS), principal component regression (PCR),and multiple linear regression (MLR)) will be discussed below, alongwith the format of the independent variables (sequence information), theformat of the dependent variable(s) (activity), and the form of themodel itself (e.g., a linear first order expression).

A model generated at block 05 is employed to identify multiple residuepositions (e.g., position 35) or specific residue values (e.g. glutamineat position 35) that are predicted to impact activity. See block 07. Inaddition to identifying such positions, it may “rank” the residuepositions or residue values based on their contributions to activity.For example, the model may predict that glutamine at position 35 has themost pronounced, positive effect on activity, phenylalanine at position208 has the second most pronounced, positive effect, and so on. In aspecific approach described below, PLS or PCR regression coefficientsare employed to rank the importance of specific residues. In anotherspecific approach, a PLS load matrix is employed to rank the importanceof specific residue positions.

After the process has identified residues that impact activity, some ofthem are selected for variation as indicated at a block 09. This is donefor the purpose of exploring sequence space. Residues are selected usingany of a number of different selection protocols, some of which will bedescribed below. In one example, specific residues predicted to have themost beneficial impact on activity are preserved (i.e., not varied). Acertain number of other residues predicted to have a lesser impact are,however, selected for variation. In another example, the residuepositions found to have the biggest impact on activity are selected forvariation, but only if they are found to vary in high performing membersof the training set. For example, if the model predicts that residueposition 197 has the biggest impact on activity, but all or most of theproteins with high activity have leucine at this position, then position197 would not be selected for variation in this approach. In otherwords, all or most proteins in a next generation library would haveleucine at position 197. However, if some “good” proteins had valine atthis position and others had leucine, then the process would choose tovary the amino acid at this position. In some cases, it will be foundthat a combination of two or more interacting residues have the biggestimpact on activity. Hence, in some strategies, these residues areco-varied.

After the residues for variation have been identified, the method nextgenerates a new variant library having the specified residue variation.See block 11. Various methodologies are available for this purpose. Inone example, an in vitro or in vivo recombination-based diversitygeneration mechanism is performed to generate the new variant library.Such procedures may employ oligonucleotides containing sequences orsubsequences for encoding the proteins of the parental variant library.Some of the oligonucleotides will be closely related, differing only inthe choice of codons for alternate amino acids selected for variation at09. The recombination-based diversity generation mechanism may beperformed for one or multiple cycles. If multiple cycles are used, eachinvolves a screening step to identify which variants have acceptableperformance to be used in a next recombination cycle. This is a form ofdirected evolution.

In a different example, a “reference” protein sequence is chosen and theresidues selected at 09 are “toggled” to identify individual members ofthe variant library. The new proteins so identified are synthesized byan appropriate technique to generate the new library. In one example,the reference sequence may be a top-performing member of the trainingset or a “best” sequence predicted by a PLS or PCR model.

In another approach, the sequence activity model is used as a “fitnessfunction” in a genetic algorithm for exploring sequence space. After oneor more rounds of the genetic algorithm (with each round using thefitness function to select one or more possible sequences for a geneticoperation), a next generation library is identified for use as describedin this flow chart. In a very real sense this strategy can be viewed asin silico directed evolution. In an ideal case, if one had in hand anaccurate, precise global or local fitness function one could perform allthe evolution in silico and synthesize a single best variant for use inthe final commercial or research application. Though this is likely tobe impossible to achieve in most cases such a view of the process lendsclarity to the goals and approach of using machine learning techniquesfor directed evolution.

After the new library has been produced, it is screened for activity, asindicated in a block 13. Ideally, the new library will present one ormore members with better activity than was observed in the previouslibrary. However, even without such advantage, the new library canprovide beneficial information. Its members may be employed forgenerating improved models that account for the effects of thevariations selected in 09, and thereby more accurately predict activityacross wider regions of sequence space. Further, the library mayrepresent a passage in sequence space from a local maximum toward aglobal maximum (in activity).

Depending on the goal of process 01, it may be desirable to generate aseries of new protein variant libraries, with each one providing newmembers of a training set. The updated training set is then used togenerate an improved model. To this end, process 01 is shown with adecision operation 15, which determines whether yet another proteinvariant library should be produced. Various criteria can be used to makethis decision. Examples include the number of protein variant librariesgenerated so far, the activity of top proteins from the current library,the magnitude of activity desired, and the level of improvement observedin recent new libraries.

Assuming that the process is to continue with a new library, the processreturns to operation 05 where a new sequence-activity model is generatedfrom sequence and activity data obtained for the current protein variantlibrary. In other words, the sequence and activity data for the currentprotein variant library serves as part of the training set for the newmodel (or it may serve as the entire training set). Thereafter,operations 07, 09, 11, 13, and 15 are performed as described above, butwith the new model.

At some point, in process 01, this cycle will end and no new librarywill be generated. At that point, the process may simply terminate orone or more sequences from one or more of the libraries may be selectedfor development and/or manufacture. See block 17.

A. Choosing Protein Variant Libraries

Protein variant libraries are groups of multiple proteins having one ormore residues that vary from member to member in a library. They may begenerated by methods of this invention. They may provide the data fortraining sets used to generate sequence-activity models in accordancewith this invention. The number of proteins included in a proteinvariant library depends on the application and the cost.

In one example, the protein variant library is generated from one ormore naturally occurring proteins, which may be protein members encodedby a single gene family. Other starting points for the library may beused. From these seed or starting proteins, the library may be generatedby various techniques. In one case, the library is generated by DNAfragmentation-mediated recombination as described in Stemmer (1994)Proc. Natl. Acad. Sci. USA 10747-10751 and WO 95/22625, syntheticoligonucleotide-mediated recombination as described in Ness et al.(2002) Nature Biotechnology 20:1251-1255 and WO 00/42561) on nucleicacids encoding part or all of one or more parent proteins. A combinationof these methods may be used as well (i.e., recombination of DNAfragments and synthetic oligonucleotides) as well as otherrecombination-based methods described in, for example, WO97/20078 andWO98/27230.

In another case, a single starting sequence is modified in various waysto generate the library. Preferably the library is generated bysystematically varying the individual residues of the starting sequence.In one example, a design of experiment (DOE) methodology is employed toidentify the systematically varied sequences. In another example, a “wetlab” procedure such as oligonucleotide-mediated recombination is used tointroduce some level of systematic variation.

As used herein, the term “systematically varied sequences” refers to aset of sequences in which each residue is seen in multiple contexts. Inprinciple, the level of systematic variation can be quantified by thedegree to which the sequences are orthogonal from one another (maximallydifferent compared to the mean). In practice, the process does notdepend on having maximally orthogonal sequences, however, the quality ofthe model will be improved in direct relation to the orthogonality ofthe sequence space tested. In a simple example, a peptide sequence issystematically varied by identifying two residue positions, each ofwhich can have one of two different amino acids. A maximally diverselibrary includes all four possible sequences. Such maximal systematicvariation increases exponentially with the number of variable positions;e.g., by 2^(N), when there are 2 options at each of N residue positions.Those having ordinary skill in the art will readily recognize thatmaximal systematic variation, however, is not required by the inventionmethods. Systematic variation provides a mechanism for identifying arelatively small set of sequences for testing that provides a goodsampling of sequence space.

Protein variants having systematically varied sequences can be obtainedin a number of ways using techniques that are well known to those havingordinary skill in the art. As indicated, suitable methods includerecombination-based methods that generate variants based on one or more“parental” polynucleotide sequences. Polynucleotide sequences can berecombined using a variety of techniques, including, for example, DNAsedigestion of polynucleotides to be recombined followed by ligationand/or PCR reassembly of the nucleic acids. These methods include thosedescribed in, for example, Stemmer (1994) Proc. Natl. Acad. Sci. USA,91:10747-10751, U.S. Pat. No. 5,605,793, “Methods for In VitroRecombination,” U.S. Pat. No. 5,811,238, “Methods for GeneratingPolynucleotides having Desired Characteristics by Iterative Selectionand Recombination,” U.S. Pat. No. 5,830,721, “DNA Mutagenesis by RandomFragmentation and Reassembly,” U.S. Pat. No. 5,834,252, “EndComplementary Polymerase Reaction,” U.S. Pat. No. 5,837,458, “Methodsand Compositions for Cellular and Metabolic Engineering,” “WO98/42832,“Recombination of Polynucleotide Sequences Using Random or DefinedPrimers,” WO 98/27230, “Methods and Compositions for PolypeptideEngineering,” WO 99/29902, “Method for Creating Polynucleotide andPolypeptide Sequences,” and the like.

Synthetic recombination methods are also particularly well suited forgenerating protein variant libraries with systematic variation. Insynthetic recombination methods, a plurality of oligonucleotides aresynthesized which collectively encode a plurality of the genes to berecombined. Typically the oligonucleotides collectively encode sequencesderived from homologous parental genes. For example, homologous genes ofinterest are aligned using a sequence alignment program such as BLAST(Atschul, et al., J. Mol. Biol., 215:403-410 (1990). Nucleotidescorresponding to amino acid variations between the homologues are noted.These variations are optionally further restricted to a subset of thetotal possible variations based on covariation analysis of the parentalsequences, functional information for the parental sequences, selectionof conservative or non-conservative changes between the parentalsequences, or other like criteria. Variations are optionally furtherincreased to encode additional amino acid diversity at positionsidentified by, for example, covariation analysis of the parentalsequences, functional information for the parental sequences, selectionof conservative or non-conservative changes between the parentalsequences, or apparent tolerance of a position for variation. The resultis a degenerate gene sequence encoding a consensus amino acid sequencederived from the parental gene sequences, with degenerate nucleotides atpositions encoding amino acid variations. Oligonucleotides are designedwhich contain the nucleotides required to assemble the diversity presentin the degenerate gene. Details regarding such approaches can be foundin, for example, Ness et al. (2002), Nature Biotechnology 20:1251-1255,WO 00/42561, “Oligonucleotide Mediated Nucleic Acid Recombination,” WO00/42560, “Methods for Making Character Strings, Polynucleotides andPolypeptides having Desired Characteristics,” WO 01/75767, “In SilicoCross-Over Site Selection,” and WO 01/64864, “Single-Stranded NucleicAcid Template-Mediated Recombination and Nucleic Acid FragmentIsolation.” The identified polynucleotide variant sequences may betranscribed and translated, either in vitro or in vivo, to create a setor library of protein variant sequences.

The set of systematically varied sequences can also be designed a prioriusing design of experiment (DOE) methods to define the sequences in thedata set. A description of DOE methods can be found in Diamond, W. J.(2001) Practical Experiment Designs: for Engineers and Scientists, JohnWiley & Sons and in “Practical Experimental Design for engineers andscientists” by William J Drummond (1981) Van Nostrand Reinhold Co NewYork, “Statistics for experimenters” George E. P. Box, William G Hunterand J. Stuart Hunter (1978) John Wiley and Sons, New York, or, e.g., onthe world wide web at itl.nist.gov/div898/handbook/. There are severalcomputational packages available to perform the relevant mathematics,including Statistics Toolbox (MatLab), JMP, Statistica and StateaseDesign expert. The result is a systematically varied and orthogonaldispersed data set of sequences that is suitable for building thesequence activity model of the present invention. DOE-based data setscan be readily generated using either Plackett-Burman or FractionalFactorial designs. Id.

In engineering and chemical sciences, fractional factorial designs, forexample, are used to define fewer experiments (than in full factorialdesigns) in which a factor is varied (toggled) between two or morelevels. Optimization techniques are used to ensure that the experimentschosen are maximally informative in accounting for factor spacevariance. The same design approaches (e.g., fractional factorial,D-optimal design) can be applied in protein engineering to constructfewer sequences where a given number of positions are toggled betweentwo or more residues. This set of sequences would be an optimaldescription of systematic variance present in the protein sequence spacein question.

An example of the DOE approach applied to protein engineering includesthe following operations:

-   -   1) Identify positions to toggle based on the principles        described earlier (present in parental sequences, level of        conservation, etc.)    -   2) Create a DOE experiment using one of the commonly available        statistical packages by defining the number of factors (variable        positions), the number of levels (choices at each position), and        the number of experiments to run. The information content of the        output matrix (typically consisting of 1s and 0s that represent        residue choices at each position) depends directly on the number        of experiments to run (the more the better).    -   3) Use the output matrix to construct a protein alignment that        codes the 1s and 0s back to specific residue choices at each        position.    -   4) Synthesize the genes encoding the proteins represented in the        protein alignment.    -   5) Test the proteins encoded by the synthesized genes in        relevant assay(s).    -   6) Build a model on the tested genes/proteins.    -   7) Follow the steps described before to identify positions of        importance and to build a subsequent library with improved        fitness.

For example purposes, consider a protein in which the functionally bestamino acid residues at 20 positions are to be determined, e.g., wherethere are 2 possible amino acids available at each position. In thiscase, a resolution IV factorial design would be appropriate. Aresolution IV design is defined as one which is capable of elucidatingthe effects of all single variables, with no two-factor effectsoverlapping them. The design would then specify a set of 40 specificamino acid sequences that would cover the total diversity of 2²⁰ (˜1million) possible sequences. These sequences are then generated by astandard gene synthesis protocol and the function and fitness of theseclones is determined.

An alternative to the above approaches is to employ all availablesequences, e.g., the GenBank® database and other public sources, toprovide the protein variant library. Although this entails massivecomputational power, current technologies make the approach feasible.Mapping all available sequences provides an indication of sequence spaceregions of interest.

B. Generating a Sequence Activity Model & Using That Model to IdentifyResidue Positions for Variation

As indicated above, a sequence-activity model used with the presentinvention relates protein sequence information to protein activity. Theprotein sequence information used by the model may take many forms.Frequently, it is a complete sequence of the amino acid residues in aprotein; e.g., HGPVFSTGGA . . . . In some cases, however, it may beunnecessary to provide the complete amino acid sequence. For example, itmay be sufficient to provide only those residues that are to be variedin a particular research effort. At later stages in research, forexample, many residues may be fixed and only limited regions of sequencespace remain to be explored. In such situations, it may be convenient toprovide sequence activity models that require, as inputs, only theidentification of those residues in the regions of the protein where theexploration continues. Still further, some models may not require exactidentities of residues at the residue positions, but instead identifyone or more physical or chemical properties that characterize the aminoacid at a particular residue position. For example, the model mayrequire specification of residue positions by bulk, hydrophobicity,acidity, etc. In some models, combinations of such properties areemployed.

The form of the sequence-activity model can vary widely, so long as itprovides a vehicle for correctly approximating the relative activity ofproteins based on sequence information. Generally, it will treatactivity as a dependent variable and sequence/residue values asindependent variables. Examples of the mathematical/logical form ofmodels include linear and non-linear mathematical expressions of variousorders, neural networks, classification and regression trees/graphs,clustering approaches, recursive partitioning, support vector machines,and the like. In one preferred embodiment, the model form is a linearadditive model in which the products of coefficients and residue valuesare summed. In another preferred embodiment, the model form is anon-linear product of various sequence/residue terms, including certainresidue cross products (which represent interaction terms betweenresidues).

Models are developed from a training set of activity versus sequenceinformation to provide the mathematical/logical relationship betweenactivity and sequence. This relationship is typically validated prior touse for predicting activity of new sequences or residue importance.

Various techniques for generating models are available. Frequently, suchtechniques are optimization or minimization techniques. Specificexamples include partial least squares, various other regressiontechniques, as well as genetic programming optimization techniques,neural network techniques, recursive partitioning, support vectormachine techniques, CART (classification and regression trees), and/orthe like. Generally, the technique should produce a model that candistinguish residues that have a significant impact on activity fromthose that do not. Preferably, the model should also rank individualresidues or residue positions based on their impact on activity.

In one important class of techniques, models are generated by aregression technique that identifies covariation of independent anddependent variables in a training set. Various regression techniques areknown and widely used. Examples include multiple linear regression(MLR), principal component regression (PCR) and partial least squaresregression (PLS).

MLR is the most basic of these techniques. It simply solves a set ofcoefficient equations for members of a training set. Each equationrelates to the activity of a training set member (dependent variable)with the presence or absence of a particular residue at a particularposition (independent variables). Depending upon the number of residueoptions in the training set, these expressions can be quite large.

Like MLR, PLS and PCR generate models from equations relating sequenceactivity to residue values. However, these techniques do so in adifferent manner. They first perform a coordinate transformation toreduce the number of independent variables. They then perform theregression on the transformed variables. In MLR, there are a potentiallyvery large number of independent variables: two or more for each residueposition that varies within the training set. Given that proteins andpeptides of interest are often quite large and the training set mayprovide many different sequences, the number of independent variablescan quickly become very large. By reducing the number of variables tofocus on those that provide the most variation in the data set, PLS andPCR generally require fewer samples and simplify the problem ofgenerating a model.

PCR is similar to PLS regression in that the actual regression is doneon a relatively small number of latent variables obtained by coordinatetransformation of the raw independent variables (residue values). Thedifference between PLS and PCR is that the latent variables in PCR areconstructed by maximizing covariation between the independent variables(residue values). In PLS regression, the latent variables areconstructed in such a way as to maximize the covariation between theindependent variables and the dependent variables (activity values).Partial Least Squares regression is described in Hand, D. J., et al.(2001) Principles of Data Mining (Adaptive Computation and MachineLearning), Boston, Mass., MIT Press, and in in Geladi, et al. (1986)“Partial Least-Squares Regression: a Tutorial,” Anal. Chim. Acta,198:1-17. Both of these references are incorporated herein by referencefor all purposes.

In PCR and PLS, the direct result of the regression is an expression foractivity that is a function of the weighted latent variables. Thisexpression can be transformed to an expression for activity as afunction of the original independent variables by performing acoordinate transformation that converts the latent variables back to theoriginal independent variables.

In essence, both PCR and PLS first reduce the dimensionality of theinformation contained in the training set and then perform a regressionanalysis on a transformed data set; which has been transformed toproduce new independent variables, but preserves the original dependentvariable values. The transformed versions of the data sets may result inonly a relatively few expressions for performing the regressionanalysis. Compare this with a situation where no dimension reduction isperformed. In that situation, each separate residue for which there canbe a variation must be considered. This can be a very large set ofcoefficients; 2^(N) coefficients, where N is the number of residuepositions that may vary in the training set. In a typical principalcomponent analysis, only 3, 4, 5, 6 principal components are employed.

The ability of machine learning techniques to fit the training data isoften referred to as the “model fit” and in regression techniques suchas MLR, PCR and PLS is typically measured by the sum squared differencebetween measured and predicted values. For a given training set, theoptimal model fit will always be achieved using MLR, with PCR and PLSoften having a worse model fit (higher sum squared error betweenmeasurements and predictions). However, the chief advantage of usinglatent variable regression techniques such as PCR and PLS lies in thepredictive ability of such models. Obtaining a model fit with very smallsum squared error in no way guarantees the model will be able toaccurately predicted new samples not seen in the training set—in fact,it is often the opposite case, particularly when there are manyvariables and only a few observations (samples). Thus latent variableregression techniques (PCR, PLS) while often having worse model fits onthe training data are usually more robust and are able to predict newsamples outside the training set more accurately.

Another class of tools that can be used to generate models in accordancewith this invention is the support vector machines. These mathematicaltools take as inputs training sets of sequences that have beenclassified into two or more groups based on activity. Support vectormachines operate by weighting different members of a training setdifferently depending upon how close they are to a hyperplane interfaceseparating “active” and “inactive” members of the training set. Thistechnique requires that the scientist first decide which training setmembers to place in the active group and which training set members toplace in the inactive group. This may be accomplished by choosing anappropriate numerical value of activity to serve as the boundary betweenactive and inactive members of the training set. From thisclassification, the support vector machine will generate a vector, W,that can provide coefficient values for individual ones of theindependent variables defining the sequences of the active and inactivegroup members in the training set. These coefficients can be used to“rank” individual residues as described elsewhere herein. The techniqueattempts to identify a hyperplane that maximizes the distance betweenthe closest training set members on opposite sides of that plane. Inanother variation, support vector regression modeling is carried out. Inthis case, the dependent variable is a vector of continuous activityvalues. The support vector regression model will generate a coefficientvector, W, which can be used to rank individual residues.

SVMs have been used to look at large data sets in many studies and havebeen quite popular in the DNA microarray field. Their potentialstrengths include the ability to finely discriminate (by weighting)which factors separate samples from each other. To the extent that anSVM can tease out precisely which residues contribute to function, itcan be a particularly useful tool for ranking residues in accordancewith this invention. SVMs are described in S. Gunn (1998) “SupportVector Machines for Classification and Regressions,” Technical Report,Faculty of Engineering and Applied Science, Department of Electronicsand Computer Science, University of Southampton, which is incorporatedherein by reference for all purposes.

Another model generation technique of interest is genetic programming.This technique employs a Darwinian style evolution to discover theformulae and rules that characterize the data of a training set. It canbe used in regression problems of the types described herein. Theunderlying effect can be linear or non-linear. Genetic programming isdescribed in R. Goodacre et al. (2000) “Detection of the DipicolinicAcid Biomarker in Bacillus Spores Using Curie-Point Pyrolysis MassSpectrometry and Fourier Transform Infrared Spectroscopy,” Anal. Chem.,72, 119-127, which is incorporated herein by reference for all purposes.Examples of software tools for performing genetic programming includethe “GMAX” and the “GMAX-Bio” available from Aber Genomic Computing Ltdof Wales, UK.

i) Linear Model Examples

While the present invention is directed to non-linear models, these maybe more easily understood in the context of linear models of sequenceversus activity. Thus, the form and development of linear models willnow be described. In general, a linear regression model of activityversus sequence has the following form:

$\begin{matrix}{y = {c_{0} + {\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{M}{c_{ij}x_{ij}}}}}} & (1)\end{matrix}$

In this linear expression, y is predicted response, while c_(ij) andx_(ij) are the regression coefficient and bit value or dummy variableused to represent residue choice, respectively at position i in thesequence. There are N residue positions in the sequences of the proteinvariant library and each of these may be occupied by one or moreresidues. At any given position, there may be j=1 through M separateresidue types. This model assumes a linear (additive) relationshipbetween the residues at every position. An expanded version of equation1 follows:

y=c ₀ +c ₁₁ x ₁₁ +c ₁₂ x ₁₂ + . . . c _(1M) x _(1M) +c _(2l) x ₂₁ +c ₂₂x ₂₂ + . . . c _(2M) x _(2M) + . . . +c _(NM) x _(NM)

As indicated, data in the form of activity and sequence information isderived from the initial protein variant library and used to determinethe regression coefficients of the model. The dummy variables are firstidentified from an alignment of the protein variant sequences. Aminoacid residue positions are identified from among the protein variantsequences in which the amino acid residues in those positions differbetween sequences. Amino acid residue information in some or all ofthese variable residue positions may be incorporated in the sequenceactivity model.

Table I contains sequence information in the form of variable residuepositions and residue types for 10 illustrative variant proteins, alongwith activity values corresponding to each variant protein. Understandthat these are representative members of a larger set that is requiredto generate enough equations to solve for all the coefficients. Thus,for example, for the illustrative protein variant sequences in Table I,positions 10, 166, 175, and 340 are variable residue positions and allother positions, i.e., those not indicated in the Table, containresidues that are identical between Variants 1-10.

TABLE I Illustrative Sequence and Activity Data Variable Positions: 10166 175 340 y (activity) Variant 1 Ala Ser Gly Phe y₁ Variant 2 Asp PheVal Ala y₂ Variant 3 Lys Leu Gly Ala y₃ Variant 4 Asp Ile Val Phe y₄Variant 5 Ala Ile Val Ala y₅ Variant 6 Asp Ser Gly Phe y₆ Variant 7 LysPhe Gly Phe y₇ Variant 8 Ala Phe Val Ala y₈ Variant 9 Lys Ser Gly Phe y₉Variant 10 Asp Leu Val Ala y₁₀ and so on.

Thus, based on equation 1, a regression model can be derived from thesystematically varied library in Table I, i.e.:

y=c ₀ +c _(10Ala) x _(10Ala) +c _(10Asp) x _(10Asp) +c _(10Lys) x_(10Lys) +c _(166Ser) x _(166Ser) +c _(166Phe) x _(166Phe) +c _(166Leu)x _(166Leu) +c _(166Ile) x _(166Ile) +c _(175Gly) x _(175Gly) +c_(175Val) x _(175Val) +c _(340Phe) x _(340Phe) +c _(340Ala) x_(340Ala)  (2)

The bit values (x dummy variables) can be represented as either 1 or 0reflecting the presence or absence of the designated amino acid residueor alternatively, 1 or −1, or some other surrogate representation. Forexample, using the 1 or 0 designation, x_(10Ala) would be “1” forVariant 1 and “0” for Variant 2. Using the 1 or −1 designation,x_(10Ala) would be “1” for Variant 1 and “−1” for Variant 2. Theregression coefficients can thus be derived from regression equationsbased on the sequence activity information for all variants in library.Examples of such equations for Variants 1-10 (using the 1 or 0designation for x) follow:

y ₁ =c ₀ +c _(10Ala)(1)+c _(10Asp)(0)+c _(10Lys)(0)+c _(166Ser)(1)+c_(166Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175Val)(0)+c _(340Phe)(1)+c _(340Ala)(0)

y ₂ =c ₀ +c _(10Ala)(0)+c _(10Asp)(1)+c _(10Lys)(0)+c _(166Ser)(0)+c_(166Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175Val)(1)+c _(340Phe)(0)+c _(340Ala)(1)

y ₃ =c ₀ +c _(10Ala)(0)+c _(10Asp)(0)+c _(10Lys)(1)+c _(166Ser)(0)+c_(166Phe)(0)+c _(166Leu)(1)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175Val)(0)+c _(340Phe)(0)+c _(340Ala)(1)

y ₄ =c ₀ +c _(10Ala)(0)+c _(10Asp)(1)+c _(10Lys)(0)+c _(166Ser)(0)+c_(166Phe)(0)+c _(166Leu)(0)+c _(166Ile)(1)+c _(175Gly)(0)+c_(175Val)(1)+c _(340Phe)(1)+c _(340Ala)(0)

y ₅ =c ₀ +c _(10Ala)(1)+c _(10Asp)(0)+c _(10Lys)(0)+c _(166Ser)(0)+c_(166Phe)(0)+c _(166Leu)(0)+c _(166Ile)(1)+c _(175Gly)(0)+c_(175Val)(1)+c _(340Phe)(0)+c _(340Ala)(1)

y ₆ =c ₀ +c _(10Ala)(0)+c _(10Asp)(1)+c _(10Lys)(0)+c _(166Ser)(1)+c_(166Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175Val)(0)+c _(340Phe)(1)+c _(340Ala)(0)

y ₇ =c ₀ +c _(10Ala)(0)+c _(10Asp)(0)+c _(10Lys)(1)+c _(166Ser)(0)+c_(166Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175Val)(0)+c _(340Phe)(1)+c _(340Ala)(0)

y ₈ =c ₀ +c _(10Ala)(1)+c _(10Asp)(0)+c _(10Lys)(0)+c _(166Ser)(0)+c_(166Phe)(1)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175Val)(1)+c _(340Phe)(0)+c _(340Ala)(1)

y ₉ =c ₀ +c _(10Ala)(0)+c _(10Asp)(0)+c _(10Lys)(1)+c _(166Ser)(1)+c_(166Phe)(0)+c _(166Leu)(0)+c _(166Ile)(0)+c _(175Gly)(1)+c_(175Val)(0)+c _(340Phe)(1)+c _(340Ala)(0)

y ₁₀ =c ₀ +c _(10Ala)(0)+c _(10Asp)(1)+c _(10Lys)(0)+c _(166Ser)(0)+c_(166Phe)(0)+c _(166Leu)(1)+c _(166Ile)(0)+c _(175Gly)(0)+c_(175Val)(1)+c _(340Phe)(0)+c _(340Ala)(1)

The complete set of equations can be readily solved using a regressiontechnique (e.g., PCR, PLS, or MLR) to determine the value for regressioncoefficients corresponding to each residue and position of interest. Inthis example, the relative magnitude of the regression coefficientcorrelates to the relative magnitude of contribution of that particularresidue at the particular position to activity. The regressioncoefficients may then be ranked or otherwise categorized to determinewhich residues are more likely to favorably contribute to the desiredactivity. Table II provides illustrative regression coefficient valuescorresponding to the systematically varied library exemplified in TableI:

TABLE II Illustrative Rank Ordering of Regression CoefficientsREGRESSION COEFFICIENT VALUE c_(166Ile) 62.15s c_(175Gly) 61.89c_(10Asp) 60.23 c_(340 Ala) 57.45 c_(10 Ala) 50.12 c_(166 Phe) 49.65c_(166Leu) 49.42 c_(340 Phe) 47.16 c_(166Ser) 45.34 c_(175 Val) 43.65c_(10 Lys) 40.15

The rank ordered list of regression coefficients can be used toconstruct a new library of protein variants that is optimized withrespect to a desired activity (i.e., improved fitness). This can be donein various ways. In one case, it is accomplished by retaining the aminoacid residues having coefficients with the highest observed values.These are the residues indicated by the regression model to contributethe most to desired activity. If negative descriptors are employed toidentify residues (e.g., 1 for leucine and −1 for glycine), it becomesnecessary to rank residue positions based on absolute value of thecoefficient. Note that in such situations, there is typically only asingle coefficient for each residue. The absolute value of thecoefficient magnitude gives the ranking of the corresponding residueposition. Then, it becomes necessary to consider the signs of theindividual residues to determine whether each of them is detrimental orbeneficial in terms of the desired activity.

ii) Non-Linear Models

Non-linear modeling is employed to account for residue-residueinteractions that contribute to activity in proteins. An N-K landscapedescribes this problem. The parameter N refers to the number of variableresidues in a collection of related polypeptides sequences. Theparameter K represents the interaction between individual residueswithin anyone of these polypeptides. Interaction is usually a result ofclose physical proximity between various residues whether in theprimary, secondary, or tertiary structure of the polypeptide. Theinteraction may be due to direct interactions, indirect interactions,physicochemical interactions, interactions due to folding intermediates,translational effects, and the like.

The parameter K is defined such that for value K=1, each variableresidue (e.g., there are 20 of them) interacts with exactly one otherresidue in its sequence. In the case where all residues are physicallyand chemically separate from the effects of all other residues, thevalue of K is zero. Obviously, depending upon the structure of thepolypeptide, K can have a wide range of different values. With arigorously solved structure of the polypeptide in question, a value forK may be estimated. Often, however, this is not the case.

A purely linear, additive model of polypeptide activity (as describedabove) can be improved by including one or more non-linear interactionterms representing specific interactions between 2 or more residues. Inthe context of the model form presented above, these terms are depictedas “cross-products” containing two or more dummy variables representingthe two or more particular residues (each associated with a particularposition in the sequence) that interact to have a significant positiveor negative impact on activity. For example, a cross-product term mayhave the form c_(ab)x_(a)x_(b), where x_(a) is a dummy variablerepresenting the presence of a particular residue at a particularposition on the sequence and the variable x_(b) represents the presenceof a particular residue at a different position (that interacts with thefirst position) in the polypeptide sequence. A detailed example form ofthe model is shown below.

The presence of all residues represented in the cross-product term (eachof two or more specific types of residue at specifically identifiedpositions) impacts the overall activity of the polypeptide. The impactcan be manifest in many different ways. For example, each of theindividual interacting residues when present alone in a polypeptide mayhave a negative impact on activity, but when each of them is presenttogether in the polypeptide, the overall effect is positive. Theopposite may be true in other cases. In addition, there may be asynergistic effect produced in which each of the individual residuesalone has a relatively limited impact on activity, but when all of themare present the effect on activity is greater than the cumulativeeffects of all the individual residues.

It is possible that a non-linear model could include a cross-productterm for every possible combination of interacting variable residues inthe sequence. However, this would not represent physical reality, asonly a subset of the variable residues actually interact with oneanother. In addition, it would result in “overfitting” to produce amodel that provides spurious results that are manifestations of theparticular polypeptides used to create the model and do not representreal interactions within the polypeptide. The correct number ofcross-product terms for a model that represents physical reality, andavoids overfitting, is dictated by the value of K. For example, if K=1,the number of cross-product interaction terms equals N.

Note that in general, it may be more preferable to have too fewcross-product terms than too many. If the relatively few cross-productterms included in the non-linear model are actually those having thebiggest influence on activity, than it is definitely preferable to havetoo few. As should be apparent, in constructing a non-linear model, itis important to identify those cross-product interaction termsrepresenting true structural interactions that have a significant impacton activity. This can be accomplished in various ways. These includeforward addition, where candidate cross-product terms (starting with theterm with largest regression coefficient and progressing to terms withsmaller regression coefficients) are added to the initial linear onlymodel one at a time until the addition of terms is no longerstatistically significant (as measured by an F-test or some otherappropriate statistical test); reverse elimination, where all possiblecross product terms are added at the beginning and removed one at a time(starting with the term with smallest regression coefficient andprogressing to terms with larger regression coefficients) until removalthe least important remaining term is statistically significant. Oneexample presented below involves the use of a genetic algorithm toidentify the useful non-linear terms.

Generally, the approach to generating a non-linear model containing suchinteraction terms is the same as the approach described above forgenerating a linear model. In other words, a training set is employed to“fit” the data to a model. However, one or more non-linear terms,preferably the cross-product terms discussed above, are added to theform of the model. Further, the resulting non-linear model, like thelinear models described above, can be employed to rank the importance ofvarious residues on the overall activity of a polypeptide. Varioustechniques can identify the best combination of variable residues aspredicted by the non-linear equation. Unfortunately, unlike the linearcase, it is often impossible to accomplish this by simple inspection ofthe additive model. Approaches to ranking the residues are describedbelow.

Note that there are a very large number of possible cross-product termsfor variable residues, even when limited to interactions caused by onlytwo residues. As more interactions occur, the number of potentialinteractions to consider for a non-linear model grows in an exponentialmanner. If the model includes the possibility of interactions thatinclude three or more residues, the number of potential terms grows evenmore rapidly.

In a simple case where there are 20 variable residues and K=1 (assumesthat each variable residue interacts with one other variable residue),there should be 20 interaction terms (cross-products) in the model. Ifthere are any fewer, the model will not fully describe the interactions(although some of the interactions may not have a significant impact onactivity), and any more and the model may overfit the data set. Thereare N*(N−1)/2 or 190 possible pairs of interactions. Finding thecombination of 20 unique pairs that describe the 20 interactions in thesequence is a significant computational problem. There are approximately5.48×10²⁶ possible combinations.

Numerous techniques can be employed to identify the relevantcross-product terms. Depending upon the size of the problem and thecomputational power available, one might be able to explore all possiblecombinations and thereby identify the one model that best fits the data(the numbers of the training set). However, often the problem will betoo large for the available computational resources and so one mustresort to an efficient search algorithm or an approximation. Asmentioned, one suitable search technique is a genetic algorithm.

In a genetic algorithm, an appropriate fitness function and anappropriate mating procedure are defined. The fitness function providesa criterion for determining which models (combinations of cross-productterms) are “most fit” (i.e., likely to provide the best results). Themating procedure provides a mechanism for introducing new combinationsof cross-product terms from successful “parental” models in a previousgeneration. One example of a genetic algorithm for identifying acombination of cross-product terms will now be described with referenceto FIG. 2. This algorithm begins with a first generation comprisingmultiple possible models, some of which do a better job of representingphysical reality than others. See block 201. The first and eachsuccessive generation is represented as a number of models in a“population”. Each “model” is a combination of linear terms (fixedacross all models) and non-linear cross-product terms. The “model” inthis genetic algorithm does not intrinsically include coefficients forthe individual linear and non-linear terms, only an identification of acombination of non-linear terms (e.g., the cross-product terms). Thegenetic algorithm proceeds towards convergence by marching throughsuccessive generations of models, each characterized by a differentcombination of non-linear interaction terms.

Each model in a generation is used to fit a training set of polypeptides(having known sequences and associated activities). The training set isused to fit the individual models of the current generation. See block203, 205, 207, and 209 of FIG. 2. In one example, a partial leastsquares technique or similar regression technique is used to perform thefit.

The predictive power of the resulting model (including coefficientsobtained by the regression on the training set data) is used as afitting function. To provide a detailed assessment of the predictivepower, many different fits of a model may be provided for a giventraining set. See blocks 205, 207, and 209. Each fit provides its ownunique set of coefficient values for the linear and non-linear terms ofthe model under consideration. In one approach, a “leave one out”approach is employed in which all but one member of the training set isused to fit the model. This left out member is then used to test thepredictive power of the resulting instance of the model. The modelinstance (model terms together coefficient values identified by fitting)is expected to do a good job of predicting the activities of thetraining set members employed to produce it. However, it may not do sowell at predicting the activity of a polypeptide from outside theutilized members of the training set. In a specific embodiment, multiple“leave one out” model instances are generated and each is assessed forits ability to predict the activity of the left out member. Theresulting set of predictions is combined to get an aggregate measure ofthe predictive capability (see block 211). In one example, the aggregatemeasure is a predicted residual sum of squares (PRESS) for the variousleave one out model instances of the current model. The PRESS is, ineffect, the fitting function of the genetic algorithm.

After each combination of non-linear cross-product terms (model) in aparticular generation is evaluated for its predictive power (i.e.,decision 213 is answered in the negative), the genetic algorithm ischecked for convergence. See block 215. Assuming that the geneticalgorithm has not yet converged, the models of the current generationare ranked. Those that do the best job of predicting activity may bepreserved and used in the next generation. See block 217. For example,an elitism rate of 10% may be employed. In other words, the top 10% ofmodels (as determined using the fitting function and measured by, e.g.,PRESS scores) are set aside to become members of the next generation.The remaining 90% of the members in the next generation are obtained bymating “parents” from the previous generation. See blocks 219, 221, and223.

The “parents” are models selected randomly from the previous generation.See block 219. However, the random selection is typically weightedtoward more fit numbers of the previous generation. For example, theparent models may be selected using a linear weighting (e.g., a modelthat performs 1.2 times better than another model is 20% more likely tobe selected) or a geometric weighting (i.e., the predictive differencesin models are raised to a power in order to obtain a probability ofselection).

After a set of parent models has been selected, pairs of such models aremated (block 221) to produce children models by providing somenon-linear terms from one parent and other non-linear terms form theother parent. In one approach, the non-linear terms (cross-products) ofthe two parents are aligned and each term is considered in succession todetermine whether the child should take the term from parent A or fromparent B. In one implementation, the mating process begins with parent Aand randomly determines whether a “cross over” event should occur at thefirst non-linear term encountered. If so, the term is taken from parentB. If not, the term is taken from parent A. The next term in successionis considered for cross over, etc. The terms continue to come from theparent donating the previous term under consideration until a cross overevent occurs. At that point, the next term is donated from the otherparent and all successive terms are donated from that parent untilanother cross over event occurs. To ensure that the same non-linearcross-product terms is not selected at two different locations in thechild model, various techniques may be employed, e.g., a partiallymatched cross over technique.

After each non-linear term has been considered, a child “model” isdefined for the next generation. Then another two parents are chosen toproduce another child model, and so on. Eventually, after a completegeneration has been selected in this manner (block 223), the nextgeneration is ready for evaluation, and process control then returns toblock 203, where the members of the next generation are evaluated asdescribed above.

The process continues generation-by-generation until convergence, (i.e.,decision block 215 is answered in the negative. At that point, the topranked model is selected from the current generation as the overall bestmodel. See block 225. Convergence can be tested by many conventionaltechniques. Generally, it involves determining that the performance ofthe best model from a number of successive generations does not changeappreciably.

At this point, an example will be presented to show the value ofincorporating non-linear cross-product terms in a model predictingactivity from sequence. Consider the following non-linear model in whichit is assumed there are only two residue options at each variableposition in the sequence. In this example, the protein sequence is castinto a coded sequence by using dummy variables that correspond to choiceA or choice B, using +1 and −1 respectively. The model is immune to thearbitrary choice of which numerical value used to assign each residuechoice.

TABLE III Variable Position 1 2 3 4 5 6 7 8 9 10 Choice A I L L M G W KC S F Choice B V A I P H N R T A Y Protein V A L P G W K T S F SequenceModel −1 −1 +1 −1 +1 +1 +1 −1 +1 +1 Sequence

With this coding scheme, the linear model used to associate proteinsequences with activity can be written as follows:

y=c ₁ x ₁ +c ₂ x ₂ +c ₃ x ₃ + . . . +c _(n) x _(n) +c _(N) x _(N) +c₀  (3)

where y is the response (activity), c_(n) the regression coefficient forthe residue choice at position n, x the dummy variable coding for theresidue choice (+1/−1) at position n, and c₀ the mean value of theresponse. This form of the model assumes there are no interactionsbetween the variable residues—each residue choice contributesindependently to the overall fitness of the protein.

The non-linear model includes a certain number of (as yet undetermined)cross-product terms to account for interactions between residues:

y= ₁ x ₁ +c ₂ x ₂ +c ₃ x ₃ + . . . +c _(n) x _(n) +c _(1,2) x ₁ x ₃ +c_(1,3) x ₁ x ₃ +c _(2,3) x ₂ x ₃ +c ₀  (4)

where the variables are the same as those in Eq. (3) but now there arenon-linear terms, e.g., c_(1,2) is the regression coefficient for theinteraction between variable positions 1 and 2.

In order to assess the performance of the linear and non-linear models asynthetic data source known as the NK landscape (Kauffman, 1993) wasused. As mentioned, N is the number of variable positions in a simulatedprotein and K is the epistatic coupling between residues. The syntheticdata set was generated only in silico.

This data set was used to generate an initial training set with S=40synthetic samples, N=20 variable positions and K=1 (to reiterate, forK=1 each variable position is functionally coupled to one other variableposition). In generating the randomized proteins, each variable positionhad an equal probability of containing the dummy variable +1 or −1. Theresidue-residue interactions (represented by cross-products) and actualactivities are known for each member of the synthetic training set.Another V=100 samples were generated for use in validation. Again, theresidue-residue interactions and activities are known for each member ofthe validation set.

The training sets were used to construct both linear and non-linearmodels using the methods described above. Some non-linear models weregenerated with selection of the cross-product terms (using a geneticalgorithm as described above) and other non-linear models were generatedwithout selection of such terms. For the training set size of S=40, thelinear model was capable of correlating the measured and predictedvalues reasonably well, but demonstrated weaker correlation whenvalidated against data not seen in the training set (see FIG. 3A). Asshown, the dark data points represent the cross-validated predictionsmade by a linear model based on the other 39 data points in the trainingset and predicting the single held out data point. Thus there areactually 40 slightly different models represented by the dark datapoints. The light data points represent the predictions made by a singlemodel constructed from the 40 training samples and used to predict thevalidation samples V, none of which were seen in the original trainingset. The use of the validation set then gives a good measure of the truepredictive capacity of the model, as opposed to the cross validatedtraining set, which can suffer from the model overfit problem especiallyfor the non-linear cases described below.

This result for S=40 is interesting considering the linear model wasused to model a non-linear fitness landscape. In this case, the linearmodel could, at best, capture the average contribution to fitness forthe choice of a given residue. With enough of these averagecontributions taken together, the linear model could roughly predict themeasured response. The validation results for the linear model weremarginally better when the training size was increased to S=100 (seeFIG. 3B). Then tendency of relatively simple models to underfit data isknown as bias.

When the non-linear model was trained using only S=40 samples (and 20non-linear cross-product terms were selected using a genetic algorithmas described above), the correlation with training set members wasexcellent (see FIG. 3C). Unfortunately, the model contained limitedpredictive power outside the training set, as evidenced by its limitedcorrelation with measured values in the validation set. This non-linearmodel, with many potential variables (210 possible), and limitedtraining data to facilitate identification of the proper cross-productterms, was able to essentially just memorize the data set it was trainedon. This tendency of high complexity models to overfit the data is knownas the variance. The bias-variance tradeoff represents a fundamentalproblem in machine learning and some form of validation is almost alwaysrequired to address it when dealing with new or uncharacterized machinelearning problems. Satisfyingly, for the larger training set (S=100) thenon-linear model performed exceedingly well for both the trainingprediction and, more importantly, the validation prediction (see FIG.3D). The validation predictions were so good that most of the datapoints are obscured by the dark circles use to plot the training set.

For comparison, FIGS. 3E and 3F show the performance of non-linearmodels prepared without careful selection of the cross-product terms.Unlike the models in FIGS. 3C and 3D, every possible cross-product termwas chosen (i.e., 190 cross-product terms for N=20). As can be seen, theability to predict validation set activity is relatively poor comparedthat of the non-linear models generated with selection of cross-productterms. This is a manifestation of overfitting.

iii) Generating an Optimized Protein Variant Library by ModifyingModel-Predicted Sequences

Rather than simply synthesizing the single best-predicted protein onemay generate a combinatorial library of proteins based on a sensitivityanalysis of the best protein to changes in the residue choices at eachlocation. The more sensitive a given residue choice is for the predictedprotein, the greater the predicted fitness change will be. One can rankthese sensitivities from highest to lowest and use the sensitivityscores to create combinatorial protein libraries in subsequent rounds byincorporating those residues based on sensitivity. For a linear modelthe sensitivity may be identified by simply considering the size of thecoefficient associated with a given residue term in the model. For thenon-linear model, this will not be possible. Instead the residuesensitivity may be determined by using the model to calculate changes inactivity when a single residue is varied in the “best” predictedsequence.

Residues may be considered in the order in which they are ranked. Foreach residue under consideration, the process determines whether to“toggle” that residue. The term “toggling” refers to the introduction ofmultiple amino acid residue types into a specific position in thesequences of protein variants in the optimized library. For example,serine may appear in position 166 in one protein variant, whereasphenylalanine may appear in position 166 in another protein variant inthe same library. Amino acid residues that did not vary between proteinvariant sequences in the training set typically remain fixed in theoptimized library.

An optimized protein variant library can be designed such that all ofthe identified “high” ranking regression coefficient residues are fixed,and the remaining lower ranking regression coefficient residues aretoggled. The rationale for this being that one should search the localspace surrounding the ‘best’ predicted protein. Note that the startingpoint “backbone” in which the toggles are introduced may be the bestprotein predicted by a model or an already validated ‘best’ protein froma screened library.

In an alternative approach, at least one or more, but not all of thehigh-ranking regression coefficient residues identified may be fixed inthe optimized library, and the others toggled. This approach isrecommended if it is desired not to drastically change the context ofthe other amino acid residues by incorporating too many changes at onetime. Again, the starting point for toggling may be the best set ofresidues as predicted by the model or a best validated protein from anexisting library. Or the starting point may be an “average” clone thatmodels well. In this case, it may be desirable to toggle the residuespredicted to be of higher importance. The rationale for this being thatone should explore a larger space in search for activity hillspreviously omitted from the sampling. This type of library is typicallymore relevant in early rounds as it generates a more refined picture forsubsequent rounds.

Alternatives to the above methodology involve different procedures forusing residue importance (rankings) in determining which residues totoggle. In one such alternative, higher ranked residue positions aremore aggressively favored for toggling. The information needed in thisapproach includes the sequence of a best protein from the training set,a PLS or PCR predicted best sequence, and a ranking of residues from thePLS or PCR model. The “best” protein is a wet-lab validated “best” clonein the dataset (clone with the highest measured function that stillmodels well, i.e., falls relatively close to the predicted value incross validation). The method compares each residue from this proteinwith the corresponding residue from a “best predicted” sequence havingthe highest value of the desired activity. If the residue with thehighest load or regression coefficient is not present in the ‘best’clone, the method introduces that position as a toggle position for thesubsequent library. If the residue is present in the best clone, themethod will not treat the position as a toggle position, and it willmove the next position in succession. The process is repeated forvarious residues, moving through successively lower load values, untilthe library is of sufficient size is generated.

The number of regression coefficient residues to retain, and number ofregression coefficient residues to toggle, can be varied. Factors toconsider include the desired library size, the magnitude of differencebetween regression coefficients, and the degree to which nonlinearity isthought to exist—retaining residues with small (neutral) coefficientsmay uncover important nonlinearities in subsequent rounds of evolution.Typical optimized protein variant libraries of the present inventioncontain about 2^(N) protein variants, where N represents the number ofpositions that are toggled between two residues. Stated another way, thediversity added by each additional toggle doubles the size of thelibrary such that 10 toggle positions produces ˜1,000 clones (1,024), 13positions ˜10,000 clones (8,192) and 20 positions ˜1,000,000 clones(1,048,576). The appropriate size of library depends on factors such ascost of screen, ruggedness of landscape, preferred percentage samplingof space etc. In some cases, it has been found that a relatively largenumber of changed residues produces a library in which an inordinatelylarge percentage of the clones are non-functional. Therefore for someapplications, it may be recommended that the number of residues fortoggling ranges from about 2 to about 30; i.e., the library size rangesfrom between about 4 and 2³⁰˜10⁹ clones.

In practice, one can pursue various subsequent round library strategiesat the same time, with some strategies being more aggressive (fixingmore “beneficial” residues) and other strategies being more conservative(fixing fewer “beneficial” residues in the hopes of exploring the spacemore thoroughly).

It may be desirable to identify and preserve groups or residues or“motifs” that occur in most naturally occurring or otherwise successfulpeptides. For example, it may be found that Ile at variable position 3is always coupled with Val at variable position 11 in naturallyoccurring peptides. It has been found that such residue groups can beimportant to preserving activity in the peptide. Hence, in oneembodiment, preservation of such groups is required in any togglingstrategy. In other words, the only accepted toggles are those thatpreserve a particular grouping in the base protein or those thatgenerate a different grouping that is also found in active proteins. Inthe latter case it will be necessary to toggle two or more residues.

In varied approaches, a wet-lab validated ‘best’ (or one of the fewbest) protein in the current optimized library (i.e., a protein with thehighest, or one of the few highest, measured function that still modelswell, i.e., falls relatively close to the predicted value in crossvalidation) may serve as a backbone where various schemes of changes areincorporated. In another approach, a wet-lab validated ‘best’ (or one ofthe few best) protein in the current library that may not model well mayserve as a backbone where various schemes of changes are incorporated.In other approaches, a sequence predicted by the sequence activity modelto have the highest value (or one of the highest values) of the desiredactivity may serve as the backbone. In these approaches, the dataset forthe “next generation” library (and possibly a corresponding model) isobtained by changing residues in one or a few of the best proteins. Inone embodiment, these changes comprise a systematic variation of theresidues in the backbone. In some cases, the changes comprise variousmutagenesis, recombination and/or subsequence selection techniques. Eachof these may be performed in vitro, in vivo, or in silico.

While the optimal sequence is predicted by a linear model can beidentified by inspection as described above, the same is not true of anon-linear model. Certain residues appear in both linear and crossproduct terms and their overall affect on activity in the context ofmany possible combinations of other residues can present a dauntingproblem.

As with selection of cross product terms for a non-linear model, theoptimal sequence predicted by a non-linear model can be identified bytesting all possible sequences with the model (assuming sufficientcomputational resources) or a searching algorithm such as a geneticalgorithm. One exemplary genetic algorithm will now be described.

In this algorithm, the fitness function is simply the non-linear model'sprediction of activity. In a specific example, an elitism rate of about5-10% is employed. Selection of parents for mating involves a linearweighted fitness operation. The selected parents provide ordered sets ofresidues and a uniform cross over operation is employed. The bestcomputer generated protein is picked after no improvement is seen for atleast 15 generations.

The information contained in the computer-evolved proteins identified asdescribed above can be used to synthesize novel proteins in the lab andtest them on physical assays. An accurate in silico representation ofthe real wet lab fitness function, allows researchers to reduce thenumber of cycles of evolution or the number variants needed to screen inthe lab. Optimized protein variant libraries can be generated using therecombination methods described herein, or alternatively, by genesynthesis methods, followed by in vivo or in vitro expression. After,the optimized protein variant libraries are screened for desiredactivity, they may be sequenced. As indicated above in the discussion ofFIG. 1, the activity and sequence information from the optimized proteinvariant library can be employed to generate another sequence activitymodel from which a further optimized library can be designed, using themethods described herein. In one approach, all proteins from this newlibrary are used as part of the dataset.

iv) Alternative Modelling Options

Multiple other variations on the above approach are within the scope ofthis invention. As one example, the x_(ij) variables are representationsof the physical or chemical properties of amino acids—rather than theexact identities of the amino acids themselves (leucine versus valineversus proline, . . . ). Examples of such properties includelipophilicity, bulk, and electronic properties (e.g., formal charge, vander Waals surface area associated a partial charge, etc.). To implementthis approach, the x_(ij) values representing amino acid residues can bepresented in terms of their properties or principal componentsconstructed from the properties.

In another variation, the x_(ij) variables represent nucleotides, ratherthan amino acid residues. The goal is to identify nucleic acid sequencesthat encode proteins for a protein variant library. By using nucleotidesrather than amino acids, one can optimize on parameters other thanmerely specific activity. For example, protein expression in aparticular host or vector may be a function of nucleotide sequence. Twodifferent nucleotide sequences may encode a protein having one aminoacid sequence, but one of the nucleotide sequences expresses greaterquantities of protein and/or expresses the protein in a more activestate. By using nucleotide sequences rather than amino acid sequences,the methods of this invention can optimize for expression properties,for example, as well as specific activity.

In a specific embodiment, the nucleotide sequence is represented ascodons. Models may employ codons as the atomic unit of a nucleotidesequence such that the predicted activities are a function of variouscodons in the nucleotide sequence. Each codon together with its positionin the overall nucleotide sequence serves as an independent variable forgenerating sequence activity models. Note that different codons forgiven amino acid express differently in a given organism. Morespecifically, each organism has a preferred codon, or distribution ofcodon frequencies, for a given amino acid. By using codons as theindependent variables, the invention accounts for these preferences.Thus, the invention can be used to generate a library of expressionvariants (e.g., where “activity” includes expression level from aparticular host organism).

An outline of a particular method includes the following operations: (a)receiving data characterizing a training set of a protein variantlibrary; (b) from the data, developing a non-linear sequence activitymodel that predicts activity as a function of nucleotide types andcorresponding position in the nucleotide sequence; (c) using thesequence activity model to rank positions in a nucleotide sequenceand/or nucleotide types at specific positions in the nucleotide sequencein order of impact on the desired activity; and (d) using the ranking toidentify one or more nucleotides, in the nucleotide sequence, that areto be varied or fixed in order to impact the desired activity. Asindicated, the nucleotides to be varied are preferably codons encodingparticular amino acids.

Other variations of the above approach involve use of differenttechniques for ranking residues or otherwise characterizing them interms of importance. With linear models, the magnitudes of regressioncoefficients were used to rank residues. Residues having coefficientswith large magnitudes (e.g., 166 Ile) were viewed as high-rankingresidues. This characterization was used to decide whether or not tovary a particular residue in the generation of a new, optimized libraryof protein variants. For non-linear models, the sensitivity analysis wasmore complex.

PLS and other techniques provide other information, beyond regressioncoefficient magnitude, that can be used to rank specific residues orresidue positions. Techniques such as PLS and Principle ComponentAnalysis (PCA) or PCR provide information in the form of principlecomponents or latent vectors. These represent directions or vectors ofmaximum variation through multi-dimensional data sets such as theprotein sequence-activity space employed in this invention. These latentvectors are functions of the various sequence dimensions; i.e., theindividual residues or residue positions that comprise the proteinsequences of the variant library used to construct the training set. Alatent vector will therefore comprise a sum of contributions from eachof the residue positions in the training set. Some positions willcontribute more strongly to the direction of the vector. These will bemanifest by relatively large “loads,” i.e., the coefficients used todescribe the vector. As a simple example, a training set may becomprised of tripeptides. The first latent vector will typically havecontributions from all three residues.

Vector 1=a1(residue position 1)+a2(residue position 2)+a3(residueposition 3)

The coefficients, a1, a2, and a3, are the loads. Because these reflectthe importance of the corresponding residue positions to variation inthe dataset, they can be used to rank the importance of individualresidue positions for purposes of “toggling” decisions, as describedabove. Loads, like regression coefficients, may be used to rank residuesat each toggled position. Various parameters describe the importance ofthese loads. Some such Variable Importance in Projection (VIP) make useof a load matrix, which is comprised of the loads for multiple latentvectors taken from a training set. In Variable Importance for PLSProjection, the importance of the ith variable (e.g., residue position)is computed by calculating VIP (variable importance in projection). Fora given PLS dimension, a, (VIN)_(ak) ² is equal to the squared PLSweight (w_(ak))² of a variable multiplied by the percent explainedvariability in y (dependent variable, e.g., certain function) by thatPLS dimension. (VIN)_(ak) ² is summed over all PLS dimensions(components). VIP is then calculated by dividing the sum by the totalpercent variability in y explained by the PLS model and multiplying bythe number of variables in the model. Variables with large VIP, largerthan 1, are the most relevant for correlating with a certain function(y) and hence highest ranked for purposes of making toggling decisions.

Another embodiment of the invention employs techniques that rankresidues not simply by the magnitudes of their predicted contributionsto activity, but by the confidence in those predicted contributions aswell. In some cases the researcher will be concerned with spuriousvalues of the coefficients or principal components.

In a more statistically rigorous approach, the ranking is based on acombination of magnitude and distribution. Coefficients with both highmagnitudes and tight distributions give the highest ranking. In somecases, one coefficient with a lower magnitude than another may be givena higher ranking by virtue of having less variation. Thus, someembodiments of the invention rank residues or nucleotides based on bothmagnitude and standard deviation or variance. Various techniques can beused to accomplish this. One of these, a bootstrap p-value approach,will now be described.

An example of a method that employs a bootstrap method is depicted inFIG. 4. As shown there, a method 125 begins at a block 127 where anoriginal data set S is provided. This may be a training set as describedabove. For example, it may be generated by systematically varying theindividual residues of a starting sequence in any one of the mannersdescribed above. In the example of method 125, the data set S has Mdifferent data points (activity and sequence information collected fromamino acid or nucleotide sequences) for use in the analysis.

From data set S, various bootstrap sets B are created. Each of these isobtained by sampling, with replacement, from set S to create a new setof M members—all taken from original set S. See block 129. The “withreplacement” condition produces variations on the original set S. Thenew bootstrap set, B, will sometimes contain replicate samples from S.And, it may also lack certain samples originally contained in S.

As an example, consider a set S of 100 sequences. Each bootstrap set Bused in the method contains itself 100 sequences. A bootstrap set B iscreated by randomly selecting each of the 100 member sequences from the100 sequences in the original set S. Thus, it is possible that somesequences will be selected more than once and others will not beselected at all.

Using the bootstrap set B currently under consideration, the method nextbuilds a model. See block 131. The model may be built as describedabove, using PLS, PCR, a SVM, genetic programming, etc. This model willprovide coefficients or other indicia of ranking for the residues ornucleotides found in the various samples from set B. As shown at a block133, these coefficients or other indicia are recorded for subsequentuse.

Next, at a decision block 135, the method determines whether anotherbootstrap set should be created. If yes, the method returns to block 129where a new bootstrap set B is created as described above. If no, themethod proceeds to a block 137 discussed below. The decision at block135 turns on how many different sets of coefficient values are to beused in assessing the distributions of those values. The number of setsB should be sufficient to generate accurate statistics. As an example,100 to 1000 bootstrap sets are prepared and analyzed. This isrepresented as about 100 to 1000 passes through blocks 129, 131, and 133of method 125.

After a sufficient number bootstrap sets B have been prepared andanalyzed as described, decision 135 is answered in the negative. Asindicated, the method then proceeds to block 137. There a mean andstandard deviation of a coefficient (or other indicator generated by themodel) is calculated for each residue or nucleotide (including codons)using the coefficient values (e.g., 100 to 1000 of them, one from eachbootstrap set). From this information, the method can calculate thet-statistic and determine the confidence interval that the measuredvalue is different from zero. From the t-statistic it calculates thep-value for the confidence interval. In this case, the smaller p-valuethe more confidence that the measured regression coefficient isdifferent from zero.

Note that the p-value is but one of many different types ofcharacterization that can account for the statistical variation in acoefficient or other indicator of residue importance. Examples includecalculating 95 percent confidence intervals for regression coefficientsand excluding any regression coefficient for consideration for which 95percent confidence interval crosses zero line. Basically, anycharacterization that accounts for standard deviation, variance, orother statistically relevant measure of data distribution can be used.Such characterization preferably also accounts for the magnitude of thecoefficients.

A large standard deviation can result from various sources. One sourceis poor measurements in the data set. Another is a limitedrepresentation of a particular residue or nucleotide in the originaldata set. In this latter case, some bootstrap sets will contain nooccurrences of a particular residue or nucleotide. In such cases, thevalue of the coefficient for that residue will be zero. Other bootstrapsets will contain at least some occurrences of the residue or nucleotideand give a non-zero value of the corresponding coefficient. But the setsgiving a zero value will cause the standard deviation of the coefficientto become relatively large. This reduces the confidence in thecoefficient value and results in a lower rank. But this is to beexpected, given that there is relatively little data on the residue ornucleotide in question.

Next, at a block 139, the method ranks the regression coefficients (orother indicators) from lower (best) p-value to highest (worst) p-value.This ranking correlates highly with the absolute value of the regressioncoefficients themselves, owing to the fact that the larger the absolutevalue, the more standard deviations removed from zero. Thus, for a givenstandard deviation, the p-value becomes smaller as the regressioncoefficient becomes larger. However, the absolute ranking will notalways be the same with both p-value and pure magnitude methods,especially when relatively few data points are available to begin within set S.

Finally, as shown at a block 141, the method fixes and toggles certainresidues based on the rankings observed in the operation of block 139.This is essentially the same use of rankings described above for otherembodiments. In one approach, the method fixes the best residues (nowthose with the lowest p-values) and toggles the others (those withhighest p-values).

This method 125 has been shown in silico to perform well. Moreover, thep-value ranking approach naturally deals with single or few instanceresidues: the p-values will generally be higher (worse) because in thebootstrap process, those residues that did not appear often in theoriginal data set will be less likely to get picked up at random. Evenif their coefficients are large, their variability (measured in standarddeviations) will be quite high as well. Intuitively, this is the desiredresult, since those residues that are not well represented (either havenot seen with sufficient frequency or have lower regressioncoefficients) may be good candidates for toggling in the next round oflibrary design.

III. Digital Apparatus and Systems

As should be apparent, embodiments of the present invention employprocesses acting under control of instructions and/or data stored in ortransferred through one or more computer systems. Embodiments of thepresent invention also relate to apparatus for performing theseoperations. Such apparatus may be specially designed and/or constructedfor the required purposes, or it may be a general-purpose computerselectively activated or reconfigured by a computer program and/or datastructure stored in the computer. The processes presented herein are notinherently related to any particular computer or other apparatus. Inparticular, various general-purpose machines may be used with programswritten in accordance with the teachings herein. In some cases, however,it may be more convenient to construct a specialized apparatus toperform the required method operations. A particular structure for avariety of these machines will appear from the description given below.

In addition, embodiments of the present invention relate to computerreadable media or computer program products that include programinstructions and/or data (including data structures) for performingvarious computer-implemented operations. Examples of computer-readablemedia include, but are not limited to, magnetic media such as harddisks, floppy disks, magnetic tape; optical media such as CD-ROM devicesand holographic devices; magneto-optical media; semiconductor memorydevices, and hardware devices that are specially configured to store andperform program instructions, such as read-only memory devices (ROM) andrandom access memory (RAM), and sometimes application-specificintegrated circuits (ASICs), programmable logic devices (PLDs) andsignal transmission media for delivering computer-readable instructions,such as local area networks, wide area networks, and the Internet. Thedata and program instructions of this invention may also be embodied ona carrier wave or other transport medium (e.g., optical lines,electrical lines, and/or airwaves).

Examples of program instructions include both low-level code such asproduced by a compiler, and files containing higher level code that maybe executed by the computer using an interpreter. Further, the programinstructions include machine code, source code and any other code thatdirectly or indirectly controls operation of a computing machine inaccordance with this invention. The code may specify input, output,calculations, conditionals, branches, iterative loops, etc.

In one example, code embodying methods of the invention are embodied ina fixed media or transmissible program component containing logicinstructions and/or data that when loaded into an appropriatelyconfigured computing device causes the device to perform a geneticoperator on one or more character string. FIG. 5 shows an exampledigital device 500 that should be understood to be a logical apparatusthat can read instructions from media 517, network port 519, user inputkeyboard 509, user input 511 or other inputting means. Apparatus 500 canthereafter use those instructions to direct statistical operations indata space, e.g., to construct one or more data set (e.g., to determinea plurality of representative members of the data space). One type oflogical apparatus that can embody the invention is a computer system asin computer system 500 comprising CPU 507, optional user input deviceskeyboard 509, and GUI pointing device 511, as well as peripheralcomponents such as disk drives 515 and monitor 505 (which displays GOmodified character strings and provides for simplified selection ofsubsets of such character strings by a user. Fixed media 517 isoptionally used to program the overall system and can include, e.g., adisk-type optical or magnetic media or other electronic memory storageelement. Communication port 519 can be used to program the system andcan represent any type of communication connection.

The invention can also be embodied within the circuitry of anapplication specific integrated circuit (ASIC) or programmable logicdevice (PLD). In such a case, the invention is embodied in a computerreadable descriptor language that can be used to create an ASIC or PLD.The invention can also be embodied within the circuitry or logicprocessors of a variety of other digital apparatus, such as PDAs, laptopcomputer systems, displays, image editing equipment, etc.

IV. Other Embodiments

While the foregoing invention has been described in some detail forpurposes of clarity and understanding, it will be clear to one skilledin the art from a reading of this disclosure that various changes inform and detail can be made without departing from the true scope of theinvention. For example, all the techniques and apparatus described abovemay be used in various combinations. All publications, patents, patentapplications, or other documents cited in this application areincorporated by reference in their entirety for all purposes to the sameextent as if each individual publication, patent, patent application, orother document were individually indicated to be incorporated byreference for all purposes.

1. A method for identifying amino acid residues for variation in aprotein variant library in order to affect a desired activity, saidmethod comprising: (a) receiving data characterizing a training set of aprotein variant library, wherein the data provides activity and sequenceinformation for each protein variant in the training set; (b) from thedata, developing a sequence-activity model that predicts activity as afunction of amino acid residue type and corresponding position in aprotein sequence, wherein the sequence-activity model includes one ormore non-linear terms, each representing an interaction between two ormore amino acid residues in the protein sequence; and (c) using thesequence-activity model to identify one or more amino acid residues atspecific positions for variation to impact the desired activity.